scikit-learn: permutation_test_score

（mlxtendだけでなく）scikit-learnにも交差検証の結果を検定する関数がある

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.permutation_test_score.html

Evaluate the significance of a cross-validated score with permutations.

「順列を用いて交差検証スコアの有意性を評価する」

Permutes targets to generate ‘randomized data’ and compute the empirical p-value against the null hypothesis that features and targets are independent.

「'ランダム化したデータ'を生成して目的変数の順序を変え、特徴量と目的変数が独立であるという帰無仮説に対して実験からp値を計算する」

返り値は3つ

score

The true score without permuting targets.

permutation_scores

pvalue

The best possible p-value is 1/(n_permutations + 1), the worst is 1.0.

参考論文：Permutation Tests for Studying Classifier Performance

This function implements Test 1

Has the classifier found a significant class structure, that is, a real connection between the data and the class labels? (論文の3より)

https://scikit-learn.org/stable/modules/cross_validation.html#permutation-test-score

It provides a permutation-based p-value, which represents how likely an observed performance of the classifier would be obtained by chance.

「（permutation_test_score関数は）順列ベースのp値を提供する」

「順列ベースのp値は分類器の観測された性能が偶然に得られることがどのくらい起こりやすいのかを表す」

The null hypothesis in this test is that the classifier fails to leverage any statistical dependency between the features and the labels to make correct predictions on left out data.

「この検定の帰無仮説は、分類器が特徴量とラベルの間のどのような統計的な依存関係にも、取り分けられたデータについて正しい予測をするために投資するのに失敗したというもの」

For reliable results n_permutations should typically be larger than 100 and cv between 3-10 folds.

A low p-value provides evidence that the dataset contains real dependency between features and labels and the classifier was able to utilize this to obtain good results.

「低いp値は、データセットが特徴量とラベルの間に実在する依存関係を含み、分類器はよい結果を得るためにそれを汎化することができたことを示す」

p値が低いと帰無仮説が棄却される（※後述の点に注意）

A high p-value could be due to a lack of dependency between features and labels (there is no difference in feature values between the classes) or because the classifier was not able to use the dependency in the data.

「高いp値は特徴量とラベルの間の依存関係の不足（クラス間で特徴量の値に差がない）による、または、分類器がデータの依存関係を使えていないため」

In the latter case, using a more appropriate classifier that is able to utilize the structure in the data, would result in a lower p-value.

「後者（分類器がデータの依存関係を使えていない）のケースは、データの構造を汎化できるより適切な分類器を使うことでより小さいp値の結果となるだろう」

Cross-validation provides information about how well a classifier generalizes, specifically the range of expected errors of the classifier.

However, a classifier trained on a high dimensional dataset with no structure may still perform better than expected on cross-validation, just by chance.

「しかしながら、構造のない高次のデータセットで訓練された分類器は、偶然にも交差検証で期待されるよりもよい振る舞いをするかもしれない」

This can typically happen with small datasets with less than a few hundred samples.

「これは典型的に数百より少ないサンプル数の小さいデータセットに起こりうる」

permutation_test_score provides information on whether the classifier has found a real class structure and can help in evaluating the performance of the classifier.

「permutation_test_scoreは、分類器が実在するクラス構造を見つけたかどうかについて情報を提供し、分類器の性能を評価する際の手助けができる」

It is important to note that this test has been shown to produce low p-values even if there is only weak structure in the data because in the corresponding permutated datasets there is absolutely no structure.

「対応する順列データセットには構造がありえないので、この検定はデータに弱い構造があるだけの場合でさえ低いp値を生成するのを示すことに注意されたい」

This test is therefore only able to show when the model reliably outperforms random guessing.

「したがって、モデルが期待通りにランダム推測を上回っているときにのみ、この検定は明らかにできる」

It is therefore only tractable with small datasets for which fitting an individual model is very fast.

（ブルートフォース（総当たり）を使って計算する。）「個々のモデルを非常に高速に訓練できる小さなデータでのみ扱いやすい」